Abstract

In this project we set out to work with an assigned dataset referred to as the “20 newsgroup” dataset. Our goal is to preprocess the dataset. We will clean the data and build a vocabulary. We will visualize a set of statistics for this preprocessed data. Next we will train an LDA model on our dataset. From there we will create a vector representation of this dataset by training a Doc2Vec model. The report will also compare and contrast visualizations from these processes.

Document Stats

category docCount sentCount wordCount numUniqueWords meanSentLength minSentLength maxSentLength stdSentLength
comp.windows.x 593 9166 75989 4992 8.290312 1 204 8.721868
comp.os.ms-windows.misc 591 10398 119751 4505 11.516734 1 1170 43.798733
talk.politics.misc 465 11603 91194 7575 7.859519 1 117 7.192296
comp.sys.ibm.pc.hardware 590 6919 50143 4485 7.247146 1 185 7.490722
talk.religion.misc 377 7843 57363 6738 7.313911 1 174 6.795975
rec.autos 594 7645 58613 5815 7.666841 1 117 6.981358
sci.space 593 9366 86337 7516 9.218129 1 606 11.898430
talk.politics.guns 546 11615 90137 7363 7.760396 1 130 7.111474
alt.atheism 480 10094 70099 6785 6.944621 1 75 5.678931
misc.forsale 585 5031 40633 5295 8.076526 1 321 13.065767
comp.graphics 584 7574 59159 5529 7.810800 1 177 7.774651
sci.electronics 591 7299 55248 5510 7.569256 1 142 6.635209
sci.crypt 595 12105 105026 7202 8.676250 1 279 8.544165
soc.religion.christian 599 12869 93529 7714 7.267775 1 83 5.540344
rec.sport.hockey 600 9879 72776 5561 7.366738 1 527 11.625689
sci.med 594 10108 82824 8107 8.193906 1 412 8.715710
rec.motorcycles 598 7393 53783 6073 7.274855 1 85 6.663944
comp.sys.mac.hardware 578 6368 45276 4441 7.109925 1 88 6.234718
talk.politics.mideast 564 16341 122177 8161 7.476715 1 284 6.935771
rec.sport.baseball 597 8268 55240 4931 6.681180 1 158 7.167797
total 11314 181160 1357526 23515 7.493519 1 1159 12.828900
## Loading required package: ggplot2
## 
## Attaching package: 'plotly'
## The following object is masked from 'package:ggplot2':
## 
##     last_plot
## The following object is masked from 'package:stats':
## 
##     filter
## The following object is masked from 'package:graphics':
## 
##     layout
## Found a grid already named: 'box-doc-stats Grid'. Since fileopt='overwrite', I'll try to update it
## Found a plot already named: 'box-doc-stats'. Since fileopt='overwrite', I'll try to update it

LDA Visualization

K-means Visualization

Document to Vector Representation